Get Data

a = read_csv('raw_payroll_data.csv')
a %>% head
a %>% glimpse
## Rows: 104
## Columns: 5
## $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ dept_id   <dbl> 6, 4, 3, 4, 2, 6, 3, NA, 3, 6, 6, 6, 4, 4, 6, 3, 5, 4, 4,...
## $ dept_name <chr> "Production", "Business Development", "Design", "Business...
## $ name      <chr> "B****, Li*** N.", "C******, Jane G.", "E******, Ma* B.",...
## $ salary    <chr> "91100", "173900", "163100", "71000", "111100", "80600", ...
#simple cleaning
#a = a %>% select(-id, -name)

a = a %>% mutate(
  salary = as.numeric(salary),
  dept_id = as.factor(dept_id)
)

#dept_id should be a factor/category variable
#especially import to convert if imputation is done

a %>% head

Exploratory Data Analysis

Check zeros, na, inf

a %>% visdat::vis_dat()

a %>% funModeling::status()
a %>% DataExplorer::plot_missing()

Check counts, levels

a %>% count(dept_name, dept_id, name = 'count') %>% mutate(percent = count/sum(count)) %>% arrange(dept_name, -percent)
a %>% select(dept_name) %>% freq
a %>% count(name) %>% arrange(-n)

Some dept names have been misspelled. Some names also have duplicates .

Correct dept name misspellings & Remove duplicate names

#correct dept misspellings
a = a %>% mutate(
  dept_name = if_else(dept_name == 'Business Developement', 'Business Development', dept_name),
  dept_name = if_else(dept_name == 'Producdion', 'Production', dept_name),
) %>% mutate(
  dept_name = factor(dept_name)
)

#check
a %>% select(dept_name) %>% freq
#remove rows with duplicate names
a = a %>% distinct(name, .keep_all = TRUE)

#check
#a %>% n_distinct()/a %>% nrow()

Data Imputation

provided mapping reference

(dept.mapping = tibble(
  dept_id = a %>% pull(dept_id) %>% unique() %>% sort,
  dept_name = c('Human Resources', 'Design', 'Business Development', 'Accounting', 'Production'),
))

Note!! I could use the provided dept mapping (replicated above) to [perfectly] impute the missing data for both dept_id & dept_name via some clever find and replacing.

But . . . there’s a faster and more automated way! Imputation via Machine Learning, specifically the Random Forest Algorithm (both Classification and Regression).

missing data reference

# these rows are missing for dept_name
a %>% filter(is.na(dept_name))
missing.dept_name.ids = a %>% filter(is.na(dept_name)) %>% pull(id)

# these rows are missing for dept_id
a %>% filter(is.na(dept_id))
missing.dept_id.ids = a %>% filter(is.na(dept_id)) %>% pull(id)

# these rows are missing for salary
a %>% filter(is.na(salary))
missing.salary.ids = a %>% filter(is.na(salary)) %>% pull(id)

actual imputation of dept_id, dept_name, salary

library(tidymodels)

rec = a %>% recipe(salary ~ .) %>% 
  step_bagimpute(dept_id, dept_name, impute_with = imp_vars(dept_id, dept_name)) %>% 
  step_bagimpute(salary, impute_with = imp_vars(dept_id, dept_name))

a.imputed = rec %>% prep %>% juice
#a.imputed %>% DataExplorer::plot_missing()

proof of perfect Random Forest Imputation

original reference

dept.mapping

observations with imputated dept name

a.imputed %>% filter( id %in% missing.dept_name.ids )

observations with imputated dept id

a.imputed %>% filter( id %in% missing.dept_id.ids )

observations with imputated salary

a.imputed %>% filter( id %in% missing.salary.ids )

avg and median salary by department

a.imputed %>% group_by(dept_name) %>% summarise(
  mean.salary = mean(salary),
  median.salary = median(salary)
)
## `summarise()` ungrouping output (override with `.groups` argument)

Observation: For individuals who had missing ‘salary’ info, the random forest algorithm replaced the missing salary salary with the median salary of the department they belonged to.

Determining Statistical Signifiance of Differences

visualize salary distributions by department

a.imputed %>% 
  plot_ly(x = ~dept_name, y = ~salary, color = ~dept_name) %>%
  add_boxplot(quartilemethod = 'inclusive') %>% 
  hide_legend() %>% 
  layout(
    title = 'Distribution of Salary by Department',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    )

chart salary distributions by department

a.imputed %>% group_by(dept_name) %>% summarise(
  q1 = quantile(salary, 0.25),
  q2 = quantile(salary, 0.50),
  q3 = quantile(salary, 0.75),
)
## `summarise()` ungrouping output (override with `.groups` argument)

Is there an sig diff between the median salaries of the depts?

anova.salary.dept = aov(salary ~ dept_name, a.imputed)
anova.salary.dept %>% tidy

Conclusion: Yes, the p values is less than 5%

Even Further Investigation

For what specific pairwise-comparisons is there a sig diff?

list all pairwise-comparisons

TukeyHSD(anova.salary.dept)

list only statistically significant pairwise-comparisons

TukeyHSD(anova.salary.dept) %>% tidy %>% filter(adj.p.value < 0.05) %>% select(contrast, adj.p.value)

Clearly ,there is financial mismanagement/ fraudulent activity occurring in the Design Department.

Looking at the salary distribution by dept viz above, the results above make sense

The ‘Design’ dept clearly has a median pay above that of the other departments. Even more telling, its first quartile, at $103,250 is nearly higher than the median of 3 other departments.